arXiv: 1502.04071 v2 [cs.IT] 16 Nov 2015 


THE GENERALIZED LASSO WITH NON-LINEAR 
OBSERVATIONS 
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Abstract. We study the problem of signal estimation from non-linear 
observations when the signal belongs to a low-dimensional set buried 
in a high-dimensional space. A rough heuristic often used in practice 
postulates that non-linear observations may be treated as noisy linear 
observations, and thus the signal may be estimated using the gener¬ 
alized Lasso. This is appealing because of the abundance of efficient, 
specialized solvers for this program. Just as noise may be diminished 
by projecting onto the lower dimensional space, the error from modeling 
non-linear observations with linear observations will be greatly reduced 
when using the signal structure in the reconstruction. We allow gen¬ 
eral signal structure, only assuming that the signal belongs to some set 
K C R n . We consider the single-index model of non-linearity. Our 
theory allows the non-linearity to be discontinuous, not one-to-one and 
even unknown. We assume a random Gaussian model for the measure¬ 
ment matrix, but allow the rows to have an unknown covariance matrix. 
As special cases of our results, we recover near-optimal theory for noisy 
linear observations, and also give the first theoretical accuracy guaran¬ 
tee for l-bit compressed sensing with unknown covariance matrix of the 
measurement vectors. 


1. Introduction 

Before describing to the non-linear setting which is the main theme of 
this paper, let us first consider the structured linear model 

y = Ax + z 

where an unknown vector x belongs to some known set K C M n . The goal 
is to reconstruct the signal x from the noisy measurement vector y £ R m . A 
common method is to minimize the 1 2 loss subject to a structural constraint: 

minimize \\Ax — y ||2 subject to x' £ K. (1.1) 

We shall refer to this generalized Lasso as the K-Lasso for the rest of the 
paper. The set K is meant to capture structure of the signal. In many cases 
of interest K behaves as if it were a low-dimensional set, although it often 
has full linear algebraic dimension. For example, to promote sparsity of the 
solution, one can choose K to be a scaled i\ ball, and this gives the vanilla 

Date: November 17, 2015. 

R. V. is partially supported by NSF grant 1265782 and U.S. Air Force grant FA9550- 
14-1-0009. 


1 



2 


YANIV PLAN AND ROMAN VERSHYNIN 


Lasso as proposed by R. Tibshirani [44]. When the signals are matrices, to 
promote low rank one can choose K to be a scaled ball in the nuclear norm, 
and this is referred to as the matrix Lasso [9] or trace Lasso [22]. 

How well can the signal be reconstructed based on the complexity of 
the set K‘ } . Under the linear model, the last two decades have seen the 
development of a strong theoretical backing for the Lasso from the statistical 
community, mostly based on a sparsity assumption. See, e.g., [8, 23, 6, 29, 
32, 46, 10]. Further, recent results developed from the compressed sensing 
community give a clean, comprehensive theory for arbitrary signal structure. 
See Section 2. 

Consider the more challenging situation, in which there is an unknown 
non-linearity in the observations. We ask: 

What happens when the K-Lasso is used to reconstruct a 

signal based on non-linear observations? 

On the one hand, Lasso is by design a method for linear regression, and it 
is dubious to expect it to work if y depends non-linearly on Ax. On the 
other hand, practitioners have been successfully using Lasso for non-linear 
(especially binary) observations without theoretical backing. 

In this paper we demonstrate that A-Lasso can be used for non-linear ob¬ 
servations. We will see that from Lasso’s point of view, non-linear observa¬ 
tions behave as scaled and noisy linear observations, and we will characterize 
the scaling and the noise. Furthermore, we assume A to be Gaussian, but 
in contrast to much of the literature, we allow unknown covariance of rows. 
A particular non-linearity of interest in signal processing is 1-bit quantiza¬ 
tion, which, when combined with sparse signal structure, leads to the model 
of 1-bit compressed sensing. We believe all previous theoretical results in 
this area have required knowledge of the covariance of rows for the recov¬ 
ery algorithm to be accurate; our work broadens the theory by removing 
this requirement. We will describe related literature regarding non-linear 
observations in Section 2 below. 

1.1. Model. We will work with semiparametric single-index model of a sim¬ 
ilar form to the one in [37]. Let x € K C M n be a fixed (unknown) signal 
vector, let ~ AA(0, £) be independent random measurement vectors, and 
let A be the matrix whose i-th row is aj. Let /) : M —> M be independent 
copies of an unknown, random function / modeling the non-linearity (it also 
may be deterministic), which are independent of A. We assume that the m 
observations yt that form the vector y = (y ±,..., y m ) take the form 

yi = fi({ai,x)). (1.2) 

Note that the norm of x is sacrificed in this model since it may be absorbed 
into the unknown random function /*. Thus, to simplify presentation, we 
will assume that ||\/5]*||2 = 1- We will remark on how to remove this 
assumption by a rescaling argument. 
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1.2. Examples. We now give two concrete examples of the above model: 
quantized and binary observations. 

A first non-linearity of interest is quantization applied to linear observa¬ 
tions. Then the function / maps (aj, x) to a finite alphabet of real numbers. 
In this case, the non-linearity is known, and furthermore, it is designed. 
Thus, the theoretical error bounds we develop below may be tuned to opti¬ 
mize the error. This observation has been made in [43]. 

On the extreme end, one may consider 1-bit quantization: f({ai,x}) = 
sign((oj, x}). Measurements of this kind are of special interest due to the 
simplicity of hardware implementation, and the robustness to multiplicative 
errors. We further discuss 1-bit quantization in Section 3 below. 

Interestingly, binary statistical models are quite similar. For example, 
f({ai,x}) = sign ((a,, x) + z t ) gives the logistic regression model, provided 
that Zi is logit noise. Other binary models are available by adjusting the 
distribution of Z{. The classical approach in these models is (regularized) 
maximum likelihood estimation [32, 14]. However, it requires knowledge 
of the form of the nonlinearity, which is equivalent to knowledge of the 
distribution of Z{, and in practice one would often not expect this to be 
known. Further, the theory requires the log-likelihood to be strongly convex, 
which ceases to hold when z % is small compared to ||a;| | 2 - Ironically, the noise 
needs to be roughly larger than the signal in the theoretical treatment of 
maximum-likelihood estimation (see [14] for a discussion of this point). In 
contrast, as we show, the AT-Lasso does not need knowledge of the non¬ 
linearity, and is accurate even when the noise Zi disappears, as in the 1-bit 
compressed sensing model. 


1.3. Simplified results when K is a subspace. To begin in a simpler 
setting, let us assume that the covariance matrix £ is identity, K is a d- 
dimensional subspace, and there is no non-linearity, just an unknown rescal¬ 
ing and noise. Thus, we assume that fi(u) = fiu + z t for u e M, where /j, > 0 
and Zi ~ Af(0,a 2 ). Then the observations take the form 

Vi = n{ai,x) + (1.3) 

The A'-Lasso (1.1) becomes the least squares estimator whose behavior is 
well known. Let x be the solution to the J\-Lasso. Then, the conditional 
expectation of the squared error with respect to A satisfies 


E II* — n x \\\ 


= - 2 -E 


1 

vHAk) 


where a^Ax) is the z-th singular value of A restricted to the subspace K. 
Since A is Gaussian, it is well conditioned with high probability as long as 
the number of observations m is significantly larger than the dimension d 
of A [48]. In this case, with high probability, each singular value does not 
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deviate significantly from ypm [48] and thus 


E ||ir 


A** III 



m 


Let us make a few observations about the ingredients involved in the 
above calculation. First, the il-Lasso gives an estimate of a scaled version 
of x. Second, note the vital requirement that the number of observations 
m exceeds the dimension of the subspace d. Third, observe that the size of 
the scaling and the noise satisfy 


H = E(f(g)-g) and a 2 = E(f(g) - gg) 2 = E f(g) 2 - g 2 , 


where g is a standard normal random variable. 

Our main result states that up to a small extra summand, the Ji-Lasso 
gives the same accuracy for non-linear observations, with a and g measured 
in the same way. To easily compare, we first state this result when K is 
a subspace. Here and in the rest of the paper, a statement is said to hold 
with high probability if it holds with probability at least 0.99. Further, the 
symbol < hides an absolute constant. 


Proposition 1.1 (Non-linear estimation on a subspace). Suppose that a,i ~ 
A7(0, I), and that y follows the semi-parametric single index model of Section 
1.1. Let K he a d-dimensional subspace and assume x E KnS^ 1 . Suppose 
that 

m>d. 

Then, with high probability, the non-linear estimator x which minimizes the 
K-Lasso (1.1) satisfies 


\x - gx\\ 2 < 


Vda +: 


m 


(1.4) 


where 


L'-=m(g)-g}, cr 2 := E(/(sO -yg) 2 , y 2 := E(/(g) -yg) 2 g 2 . (1.5) 

One sees that this mirrors the result for linear observations aside from the 
extra summand rj/y/rn, which becomes quite small with a moderate number 
of observations m. For example, in the noisy linear model (1.3) one has 
r/ = a, so this result gives the classic error rate as a special case. 

Results of the above flavour have been rigorously proven in the statistics 
literature [7], with a focus on asymptotic behaviour of the error. In this 
paper, we extend these ideas to modern trends in signal processing and 
statistics, in which it is assumed that the signal belongs to some non-linear 
low-dimensional signal structure, such as the set of sparse vectors or low- 
rank matrices. We now proceed to our main results in which K will be 
allowed to be a general set. 



5 



1.4. Main results. We will give two results below, one specialized to the 
case when the scaled signal (j,x lies at an extreme point of K with (small) 
tangent cone , and one which only assumes that /ax lies in K. 

Definition 1.2 (Tangent cone). The tangent cone 1 of K at x is 

D(I\, x) := {rh : r > 0, h e K — x}. 

For sets with non-smooth boundary, such as the l\ ball or the nuclear 
norm ball, the tangent cone at a boundary point can be quite narrow, and 
intuitively should behave like a low-dimensional subspace. We give an illus¬ 
trative example of a tangent cone in Figure 1, although, in a two-dimensional 
representation, we cannot do justice to the high-dimensional effects which 
allow convex sets to have extremely narrow tangent cones. 

While A may be singular, it can be quite well conditioned when restricted 
to the tangent cone; it is not surprising that this restricted conditioning of A 
can determine the accuracy of the solution to (1.1). Further, this restricted 
condition number can be well understood via Gordon’s escape through the 
mesh theorem (see Theorem 4.2). It states that the restriction of A onto K 
is well conditioned provided that the number of observations m exceeds the 
effective dimension of K. The effective dimension is measured in Gordon’s 
theorem by the notion of Gaussian mean width. Let us recall the notion of 
the local (Gaussian) mean width; see [36, 37, 48] for further discussion of 
the mean width and how is serves as a measure of effective dimension. 

Definition 1.3 (Local mean width). The local mean width of a subset 
K C M n is a function of scale t > 0 defined as 

w t (K) = E sup (x,g), 
xGKntB 2 

where B 2 denotes the unit Euclidean ball in M n . 

'To allow non-convex K, the above is slight variation on the standard definition of 
tangent cone [26]. The tangent cone may also be called the descent cone. 





6 


YANIV PLAN AND ROMAN VERSHYNIN 


Let us pause to explain the heuristic meaning of the local mean width of 
a cone D. The square of the mean width, wi(D) 2 , can be described as a 
measure of effective dimension of D. This can be seen on the following two 
examples. First, let D be (/-dimensional subspace in R re . It is not difficult 
to check that 

wi(D) 2 ~ d, 

up to a absolute multiplicative constants. Thus in this case, the square of 
the mean width is equivalent to the algebraic dimension d. 

A deeper example is where D = D(Bf, x) is the descent cone of the unit 
t\ ball B™ = {it G M n : ||w||i < 1} at some point x on the boundary of 
Bf. Suppose x is s-sparse, meaning that x has s non-zero coordinates. It 
should be clear that the smaller sparsity s, the thinner the descent cone D 
is. Quantitatively, this is captured by the notion of local mean width, which 
can be shown (see e.g. [11]) to behave as follows: 

w\ (D) 2 ~ slog(n/s). 

Thus, up to a logarithmic factor, the square of the mean width is again 
equivalent to the dimensionality of the signal x , which is its sparsity s. 

We refer the reader to [36, Section 2] where the notion of mean width 
is discussed in more detail, as well as to [4] where an equivalent concept of 
statistical dimension is introduced. 

Let us first state our first main result specialized to the case when £ = / 
and to descent-cone structure. 

Theorem 1.4 (Non-linear estimation with tangent cone structure). Sup¬ 
pose that a,i ~ A/"(0,/), x G S n ~ l , and that y follows the semi-parametric 
single index model of Section 1.1. Assume that px G K, and let d{K) := 
wi(D(K, px)) 2 . Suppose that 

m > d(K). 

Then, with high probability, the solution x of the K-Lasso (1.1) satisfies 

\\x-px\\ 2 <- - j= - ( 1 . 6 ) 

\Jm 

where p, rj, and a are defined in (1.5). 

It should be clear that this result extends Proposition 1.1 from linear to 
non-linear observations, and from subspaces to general sets. To see this, 
recall our observation that if K is a d-dimensional subspace, then d(K ) ~ d 
up to an absolute constant factor. 

Remark 1.5 (Boundary of K). For the above theorem to be especially useful, 
px needs to lie on the boundary of K. Otherwise, the tangent cone is the 
entire R n , and the effective dimension d(/v) is of order of n. In this case, 
the estimate becomes accurate only when the number of observations m 
exceeds the ambient dimension n rather than the effective dimension of 
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the cone, which may be significantly smaller. Thus, in practice, one would 
like to rescale K to put yx on the boundary. If fix does not lie precisely 
on the boundary, we may appeal to our more general Theorem 1.9 below. 
Further, we note that the unconstrained version of the IF-Lasso overcomes 
this obstacle. This has been proven in the asymptotic setting in [43], which 
built upon the ideas in this papper. 

A substitution argument generalizes the above result to allow an unknown 
covariance matrix. 

Corollary 1.6 (Non-linear estimation with unknown covariance matrix). 
Suppose that a,i ~ A/"(0, X), '/T,x E S’* 1-1 , and that y follows the semi- 
parametric single index model of Section 1.1. Assume that yx E K, and let 
d(K,T,) := w\(\/TiD{K, yx)] 2 . Suppose that 

m>d(I< ,X). 

Then, with high probability, the non-linear estimator x which minimizes the 
K-Lasso (1.1) satisfies 

(i.7) 

V m 

where y, rj, and a are defined in (1.5). 

Proof. We may set a* := VT>gi where g L ~ Af(0,I). Then (a,i,x) = 
( gi , y/Yix). Thus, by replacing x with \/X®, we recover the model in which 
X = I. Further, we may substitute x! with \/Yix' in the A'-Lasso to arrive 
at the \/XAT-Lasso: 

minimize || Gx' — t/|| 2 subject to x' E Vt»K (1.8) 

where G is a matrix which contains gj as its i-th row. We have now com¬ 
pletely reduced to the setup of Theorem 1.4, with the caveat that we have 
substituted x,x ', and K by \/S£c, y/T,x', and \/YjK. Apply the theorem to 
finish the proof of the corollary. □ 

Remark 1.7 (Removing X from the mean width). If the covariance matrix 
X is well conditioned, its effect on the error (1.7) can be easily evaluated 
using the inequality 

d(K, X) < cond(X) • d(K). (1.9) 

where cond(X) = ||X|| ■ ||X -1 || denotes the condition number and d(K) = 
d(K,I ) is the same as Theorem 1.4. Before we prove this bound, let us 
mention that in some situations the effect of X is much smaller than it 
predicts - for example, if K is a subspace, then d(K, X) = d(I\). 

To check (1.9), note that for the tangent cone D = D(I\,yx ) we have 

wi(v^D)= E sup (x,g) < II 1 1| - IE sup ( g,x ), (1.10) 

xeVxDnB 2 x£y/r.(DrB 2 ) 
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where the inequality follows from the elementary containment y/ED n B 2 C 
|| VTj || ■ VE(D n - 62 )- A straightforward application of Slepians inequality 
[48] then bounds the quantity in (1.10) by ||\/£ 1 || • ||\/S|| • wi(D). Thus, 
we conclude (1.9). 

Remark 1.8 (Removing the assumption that ||\/Sa :||2 = 1). The theory may 
be generalized to the case when ||Xcc ||2 7 ^ 1 with a simple rescaling argument. 
Let 5 = ||\/L 4 e ||2 and let * := x/6. Observe that 

/(( a o *)) = f(S{ai,x)) =: f({ai,x)). 

Thus, the theorem applies to the estimation of x with parameters 

H :=E[f(g) ■ g\, a 2 := E(f (g) - gg) 2 , g 2 := E(f{g) - gg) 2 g 2 . 

In some cases, one does not expect the tangent cone to have especially 
small mean width. As a motivating example, in the field of compressed 
sensing, it is standard to call x compressible if it belongs to a scaled i v 
ball for p E (0,1), or if the ratio ||*||i /||*||2 is small. In this case, which 
contrasts with the case of exact sparsity, the tangent cone may have mean 
width comparable to the ambient dimension. However, the set K itself can 
still behave in a low-dimensional fashion. Since K is not necessarily a cone, 
and is not scale invariant, it is necessary to characterize dimension with a 
scaling parameter. Fortunately, the local mean width accomplishes this task 
with t as the scaling parameter, and wt{K — gx) 2 /t 2 serving as a measure 
of the dimension at scale t. 

The next theorem considers a general signal structure. 


Theorem 1.9 (Non-linear estimation without tangent cone structure). Sup¬ 
pose that a,i ~ jV(0, 1), x E S 71 ^ 1 , and that y follows the semi-parametric 
single index model of Section 1.1. Assume that gx E K where K is con¬ 
vex 2 and let dt{K) := Wt(K — gx) 2 /t 2 . Then, the following holds with high 
probability. For any t > 0 such that 


m > d t (K), 

the non-linear estimator x which minimizes the K-Lasso (1.1) satisfies 


\x - Ai*||2 < 


V d t{K) cr + r) 


m 


+1 


(1.11) 


where g, g, and a are defined in (1.5). 


Note that one may derive Theorem 1.4 by taking the limit as t goes to 
zero in the above theorem. However, in the proofs we will give a simpler 
and more straightforward route to the proof of Theorem 1.4. 


n 

More generally, the proof only requires that I\ — gx be contained in a star shaped set. 
This star shaped set can take the place of K — gx in the results of this theorem. 
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Remark 1.10 (Non-trivial covariance matrix). As above, this result can be 
generalized to the case when the covariance matrix of the rows is E / 1. 
One would just define dt(K,T, ) in a straightforward way similar to that in 
Corollary 1.6. 

1.5. Key idea in the proof. While it may be surprising that the K- Lasso 
is provably accurate even under the (non-linear) single-index model, it be¬ 
comes much clearer when one observes that the expected loss, E 11 Ax' — y\ ||, 
is minimized by fix. In other words, regardless of the form of the non¬ 
linearity, the expected squared error is minimized by a multiple of the orig¬ 
inal signal. See Section 4 for a proof. 

In fact, one may transform the single-index model into a scaled linear 
model with an unusual noise term. Define an induced noise vector z to 
satisfy 

y = A fix + z. 

One may not expect z to play the role of noise, since it generally does not 
have zero mean, and is not independent of A. However, Zi is uncorrelated 
with a,i (see Section 4). 

We note that under this scaled linear model, one could use standard tech¬ 
niques to derive error bounds if z were deterministic, or independent of A 
[33], or if z were sub-Gaussian. However, since we make quite mild assump¬ 
tions in our single-index model, only implicitly assuming that the parameters 
y, , a, and if are well-defined, this induced noise may have heavy tails and 
requires novel analysis. Some of the tools for this analysis are available in 
the recent work [37] by the current authors and Yudovina. However, this 
earlier paper did not apply to the K-Lasso, and there were many technical 
details needed to extend these results. In particular, the extra steps in the 
proof of Theorem 1.9 are new ideas, as well as the method to give results 
with non-trivial covariance matrix. We give a detailed comparison with this 
earlier work and others in the next section. 

2. Related literature 

There is now a precise and comprehensive theory of signal reconstruction 
from linear observations, which takes into account signal structure. While it 
is largely motivated by the quite modern area of compressed sensing [18, 19], 
it is rooted in results developed in the older areas of geometric functional 
analysis [47, 21] and convex integral geometry [40]. To leverage these tools, 
it is vital to assume that the measurement matrix A is random. We give 
a brief overview of the results most closely aligned with this work. The 
literature that we describe below takes A to be a matrix with independent 
Gaussian or sub-Gaussian entries. 

In the noiseless case, signal reconstruction is possible as soon as the num¬ 
ber of observations exceeds the manifold dimension [17]. Even in the noisy 
case, there is a large pool of theory addressing signal reconstruction based 
on manifold dimension [5, 50, 51, 16]. However, in the noisy case, it is 
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necessary to make extra structural assumption of the set K beyond assum¬ 
ing that it has small manifold dimension. Otherwise, signal reconstruction 
based on a number of observations comparable to the manifold dimension 
can be unstable [20] . 

The Gaussian mean width gives an alternative measure of dimension. 
When it is applicable, it leads to simpler assumptions. Indeed, as described 
above, the Gaussian mean width controls the conditioning of A when re¬ 
stricted to a cone, as proved in Gordon’s escape through the mesh theorem. 
Rudelson and Vershynin [38] leveraged this result in the compressed sens¬ 
ing setup, showing that the signal could be reconstructed as long as the 
number of observations exceeded the squared Gaussian mean width of the 
tangent cone; Stojnic continued in this line of research [41]. Chandrasekaran 
et al. [11] extended this result to general convex bodies K. Amelunxen et 
al. [4] took a different route, synthesizing tools from conic integral geome¬ 
try to give a precise phase transition for the number of observations needed 
to reconstruct x. There work is based on the statistical dimension , which 
is roughly equivalent to the mean width, but has some extra convenient 
properties (see [4]). This showed that previous results were tight. A line of 
work by Thrampoulidis, Oymak, and Hassibi [33, 34, 42] concentrated on 
the precise reconstruction error from noisy observations, and also consid¬ 
ered unconstrained versions of the A"-Lasso. Our theoretical results in the 
non-linear case can be seen to mirror Theorem [33, Theorem 1] in the linear 
case. We state a simplified version of this theorem, specialized to Gaussian 
noise (see the original theorem for a very careful treatment of constants). 

Theorem 2.1. Suppose that ai ~ AT(0,I), x £ S n 1 , and that y fol¬ 
lows the noisy linear model (1.3). Assume that yx E K, and let d{K) : = 
w\(D(K,x)) 2 . Suppose that 

m > d(K). 

Then, with high probability, the solution x of the K-Lasso (1.1) satisfies 


\X - X \\ 2 


< 


\Jd(K) a 

y/m 


Thus, one sees that our theorem 1.4, when specialized to linear observa¬ 
tions, recovers this modern theory up to an absolute constant. 


2.1. Prior work addressing non-linearity of the observations. There 
are also numerous works, and fields of study, addressing non-linearity. We 
describe the work that is most closely related to the present paper. 

The semiparametric single-index model that we take in this paper is well 
studied in econometrics; see the monograph [24]. Most work in this area 
is asymptotic, although recent works have considered the finite case [25, 
3, 13]. However, we believe that this literature does not address, from a 
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theoretical standpoint, the gains that can be made by utilizing a general low- 
dimensional structure. See [37, Section 6] for a more thorough discussion of 
this literature. 

In contrast, our work precisely characterizes the benefits from taking into 
account low-dimensional signal structure. For example, consider the sparse 
signal structure assumed in compressed sensing, in which x contains at most 
s non-zero entries. The effective dimension is 0(slog(n/s)) which can be 
significantly smaller than the ambient dimension, n. Thus, we show only 
0(slog(n/s)) measurements are needed to estimate x. Specialized to the 
case of linear, noiseless measurements, our theory recovers the classic re¬ 
sult that x may be exactly reconstructed from this number of measure¬ 
ments. When non-linearity is present, the “noise” induced by modeling 
non-linear measurements with linear measurements is reduced proportion¬ 
ally to s\og{n/s)/m. 

The area of 1-bit compressed sensing [1] concentrates on the case when the 
non-linearity is 1-bit quantization. In other words, for g£l, f(q) = sign(g) 
or f(q ) = sign^ + z) where z is noise. This has been a lively held of research 
for several years, in part due to a wide range of applicability in both signal 
processing problems and also statistical models in which the data is inher¬ 
ently binary. The discrete nature of this problem has led to new challenges 
that were not inherent in unquantized compressed sensing. Indeed, even the 
method of reconstruction of the signal has posed a challenge, and some of 
the proposed methods, such as the approach of [37] require knowledge of the 
covariance of the rows to be accurate. We believe our paper provides the 
first analysis of the -A-Lasso for this problem, and the first theoretical result 
which allows non-trivial covariance of the rows of A. In the next section, 
we specialize our work to the 1-bit compressed sensing model. 

While there are numerous other publications which relate to various forms 
of non-linearity and low-dimensionality, there are three papers which we be¬ 
lieve are most closely related to our results [37, 32, 28]. All three papers 
address general low-dimensional signal set I\ combined with general non¬ 
linearity. Our current result builds on the work in [37], which considers 
a very similar model. There are two significant extensions that we make 
beyond this work. First, our results are tighter in the sense that when spe¬ 
cialized to the linear model, they match modern theory which is developed 
specifically for the linear model (see above). This is only true in [37] when 
the noise is larger than the signal. Further, as discussed above, the method 
espoused in [37] is not the I\- Lasso, and requires knowledge of X to be 
effective. 

The other two related works [32, 28] give a very general framework, which 
does not focus on the A'-Lasso, but can be specialized to this recovery 
method. We believe that using the framework of [32], a theorem similar 
to our Theorem 1.4 could be derived. A key statistical idea, which is put 
rigorously in [32], is that the solution to the I\ -Lasso is a good estimate 
of the minimize!’ of the expected loss. In other words, misspecification of 
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the model is tolerable provided that the true signal minimizes the expected 
loss. See [49, Theorem 1] for a simplified version of this result. As we noted 
in Section 1.5, f lx is indeed the minimizer of the expected loss—this is the 
first step in our proofs, and could be used as a first step to derive error 
bounds from the framework of [32]. However, the results of [32] are gen¬ 
eral enough that such a derivation is non-trivial. Furthermore, we do not 
require restricted strong convexity in our Theorem 1.9 or decomposability in 
any of our theorems, which are two strong requirements of [32], Similarly, 
by observing that fix minimizes expected loss, the results of [28] could be 
specialized to the A'-Lasso. This would give a result similar to our Theorem 
1.9. However, our result expands upon this in two ways: 1) In [28] it is 
assumed that y* is sub-Gaussian, whereas we make almost no assumption 
on yi —roughly, it only needs a bounded second moment; 2) In contrast to 
[28], our theory takes advantage of local structure of K around fix, thus 
allowing, for example, the consideration of tangent cones. By doing this, 
our theory re-creates classical compressed sensing results as a special case, 
for example. 

Finally, we would like to point to the new work [43] which considers the 
unconstrained version of the A'-Lasso. By considering the asymptotic regime 
and adopting a stochastic model for signals x, the authors of [43] were able 
to give a precise treatment of constants involved in the error bounds. 


3. Specialization to 1-bit compressed sensing 


As discussed above, the simplest 1-bit compressed sensing model takes 
the following form: For q G M, f(q) = sign(y), i.e., we just observe the 
sign of the linear observations. Let K be a scaling of the t\ ball and x is 
assumed to be s-sparse, i.e., to contain only s non-zero entries. This latter 
requirement implies that the tangent cone has small mean width. Indeed, 
as can be seen from [11] for instance, for the appropriate scaling of K, one 
has 

d(K) = wi(K — fix) 2 < slog(n/s). 

A straightforward calculation shows that 



Thus, Theorem 1.4 states that as long as m = 0(slog(n/s)) observations 
are observed, the A'-Lasso gives accuracy 


x — 



slog(n/s) 


m 


(3.1) 


Moreover, this bound holds for observations with general covariance struc¬ 
ture. Indeed, Corollary 1.6 combined with (1.9) imply that (3.1) remains 
true as long as £ is reasonably well conditioned. 

This yields the following surprising conclusion: 
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Even for highly non-linear observations, such as 1-bit quan¬ 
tization, the K-Lasso is quite accurate as long as the number 
of observations significantly exceeds the effective dimension 
of the signal. 


4. Proof of main results 

We begin by setting 

z := y — Ayx. 

While z is not independent of A or x, and generally does not have mean 0, 
it will nevertheless play the role of noise. As shown in [37], z satisfies 

EA t z = 0. (4.1) 

We repeat the derivation here to keep the paper self contained. It suffices 
to show that for any v 6 S'” -1 , E v J A J z = 0, which in turn would follow 
from 

E yi(ai,v) - E y(ai,v)(ai,x) = 0. 

Since the covariance of a* is identity, the second term is equal to y(x,v). 
To calculate the first term, note that gt := {a,i,x) has distribution A7(0,1). 
Then make the Gaussian decomposition ( a,i,v) = ( x,v)gi + gj- where gf- is 
independent of gu By independence, the first term above is equal to 

E yi(ai,v) = E f{gi)[(x,v)gi + g±] = (x, v) E/(&)& = y(x,v) 

where the first equality follows from our model assumption ( 1 . 2 ) that yi = 
f(gi), and the last equality follows by definition of y in (1.5). This completes 
the derivation of (4.1). 

Now let x be the solution of the A'-Lasso (1.1), that is the minimizer of 
the loss function | \Ax' — y \\2 on K. We may replace this loss function by 

L(x') := — (||A*' - y\\ 2 2 - || Ayx - y\\\) 

without affecting the minimizer x. Indeed, yx is a fixed scalar multiple of 
a fixed signal, and thus we have only squared the loss function, subtracted 
a constant and multiplied by 1/m. Now, the new loss function is very well- 
behaved in expectation. 

Lemma 4.1 (Expected loss). 

E L[x') = \\x' — /re|||. 

Proof. Expanding L(x'), we can express it more conveniently as 
1 2 

L(x') = —||AA || 2 -( h,A J z) where h:=x' — yx. (4.2) 

mm 

The second term has zero mean according to (4.1). Since the covariance 
matrix of a.; is identity, the first term is \\hW 2 in expectation, as desired. □ 
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Lemma 4.1 implies that /ax minimizes the expected loss. In order to prove 
the main theorem, we need to control the deviation from expectation of the 
two terms in the loss function (4.2). 

First, we lower bound the ratio of —1|| to its expectation value of 
\\h\\ 2 - This can be done by applying the classical result from the work of 
Gordon [21], 

Theorem 4.2 (Escape through the mesh). Let D C M n be a cone. Then 

inf IIAulL > y/m — 1 — w\(D) — r (4.3) 

veDnS™- 1 

with probability at least 1 — e _r2 / 2 . 

Next, we control the size of ( h , A T z). 

Lemma 4.3. Let D C tBlf, and let z := y — Afix as before. Then 

E sup(u, A J z) < C (w(D)a + trj ) > frn. (4.4) 

vGD 

Here and in the rest of the argument, C, c refer to numerical constants; 
their values may differ from instance to instance. Before proving Lemma 4.3, 
we pause to show how the lemma and Theorem 4.2 imply our main result. 

Proof of Theorem l.f. For convenience, let us denote the spherical part of 
the tangent cone by D = D(K,px ) n S n_1 . We begin by recording two 
events which occur with high probability. First, under the assumptions of 
our main Theorem 1.4, the escape through the mesh Theorem 4.2 implies 
that the following event holds with probability at least 0.995: 

Event 1: inf —!=||A-i ;||2 > c. 
veD yjm 

Second, Markov’s inequality combined with Lemma 4.3 implies that the 
following event holds with probability at least 0.995: 

Event 2: sup(u, A J z) < C (w(D)a + p) yfm. 

vED 

By the union bound, both events hold together with probability at least 0.99. 
(We note in passing that the probability of success, and also the constant C 
in the bound of Event 2 could be sharpened using concentration inequalities. 
However, this would not change our final presentation.) 

We now show how to bound the error vector h := x—p,x in the intersection 
of these events. Since x minimizes the loss, we have 

L(x) < L(px) = 0. 

Combine this with Equation (4.2) to give 




On the other hand, h belongs to the tangent cone D(K,fj.x), so v : = 
h/\\h\\ 2 belongs to its spherical part D = D(K,fj,x) n S’ n ~ 1 . Then, by 
Events 1 and 2, we have 

— IIAhllo > cII/iIIq and (h, A J z) < IlhlU • C (w(D)cr + rj) \fm. 
m 

Combining these two inequalities with (4.5), we obtain 

2 

cll/illo < — • \\h\\ 2 ■ C (w(D)a + rj) yfm. 
m 

Simplifying this bound, we complete the proof. □ 

We now prove Lemma 4.3. 

Proof of Lemma f.3. This proof has similar steps to the proof of Theorem 
1.3 in [37]. We begin with a projection argument to (mostly) decouple z from 
A. Let P := xx T be the orthogonal projection onto the span of x and let 
P^ := I — xx J be the projection onto the orthogonal complement. Then, 
convexity of the functional ||w||_d° := sup v&D (v,u) leads to the following 
decomposition: 

E || A t ,z||d° < E ||P ± A t z||£)o + E ||PA T z||£io =: I + II. 

We first control I. Note that, since A is Gaussian, P^A T is independent 
from PA t . It follows that P^A T is also independent of z. Indeed, to obtain 
the latter conclusion, simply note that the columns of PA T are (oj, x)x, 
and the coordinates of z are 

Zi = f((a i: x)) - n(ai,x). (4.6) 

Therefore, P^A T z is distributed identically with P^A T z, where A is an 
independent copy of A (independent also of z). Thus 

I = E ||P ± A t z|| D o = E ||P- l A t z|| D o = E IKP^A 7 + E[PA T ])z|| D o. 
Now, by Jensen’s inequality, the last quantity is bounded by 
E (((P^A 7 + PA t )z\\e>° =E||A t z|| D o. 

Now condition on z. Then A T z has distribution ||z ||2 • Af(0,I). Thus 

I < E || A t z||£)o = E ||z ||2 • w(D) < y^E ||z||| • u>(D) = \frna ■ w(D). 

Here in the first equality we used the definition of w(D ); in the last equality, 
we recall (4.6) and definition of a from (1.5). 

We now control II. Note that 

m m 

PA T Z = ^2 Zi{a,i, x)x = & ■ x 

i=1 2—1 

where ^ := Zj(a,, x) = [/((a,, x)) - x)\ (a*, x). Thus, 
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Since D C tB%, we have ||z;||d o <t. Substituting this, we obtain 


II < tE I 


i=l 


\ E E ^ 2 = V mE ^ 2 

\ i=i 


2 = tV?R- 


where the last equality follows by definition of rf from (1.5). 
complete. 


r) 

The proof is 

□ 


4.1. Proof of Theorem 1.9. When the error vector h = x — fix is not 
known to belong to a cone, but rather a general set, it can no longer be 
guaranteed that h is not in the null space of A (which was true for cones via 
Gordon’s Theorem 4.2.) Nevertheless, such bad behaviour generally only 
occurs at tiny scales, and at large scales A may be quite well conditioned 
even on general sets. This idea is made rigorous in the following lenuna, 
which is known in the geometric functional analysis community even in 
more generality, see [39, 27, 31, 30, 45]. For the sake of the reader, we will 
include a proof below. 

Lemma 4.4. Let K C M n be a star shaped set . 3 Let t > 0 and suppose 
that m > wt(K) 2 /t 2 . Then, with probability at least 1 — 2exp(— m/8), the 
following holds for all v £ K satisfying ||u ||2 > t: 

\\Av\\ 2 > c\/m\\v\\2. 

Before proving this lemma, let us combine it with Lemma 4.3 to prove 
the second main result. 


Proof of Theorem 1.9. For convenience, let us denote K x := K — fix. As 
before, we begin by considering two good events, whose intersection holds 
with probability at least 0.99, based on Lemma 4.4 and Lemma 4.3. 

„ 1 IIAnllo 

Event 1: inf —-— > c. 

v&K x ntB% yjm ||i>||2 

Event 2: sup (v , A J z) < C ( wt{K x )a + trj) \Jm. 

v£K x ntB2 

We now show how to bound the error vector h := x—fix in the intersection 
of these events. As in the proof of Theorem 1.4, the fact that x minimizes 
the loss implies that 

— II A/illo < — (h, A J z). (4.7) 

m m. 

We can assume that \\h\\ > t, since in the opposite case the error bound 
of Theorem 1.9 holds trivially. Since h £ I\ x , the inequality of Event 1 
followed by (4.7) gives 

° 2 11 111 —{h,A J z). 


is a star shaped set if it satisfies A K C K for any 0 < A < 1. 


(4.8) 
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We would like to apply the inequality of Event 2, but cannot do this directly 
because ]|Ai ||2 is not bounded above by t. Fortunately, since K x is convex 
and contains the origin, I\ x is star shaped. Using this fact, we may massage 
our bound into the form of Event 2 via a monotonicity argument. 

Divide both sides of (4.8) by 5 := ||h|| 2 - This gives 

c 2 5 < — A^ z) < — sup (u, A J z) =: /(<5), (4.9) 

171 m ite<5- 1 AqnS2 

where in the second inequality we set u = 5~ 1 h and used that h £ K x again. 
Now, since I\ x is star shaped, /(e)) is a monotonically decreasing function. 
Thus, by assumption 5 > t, we may replace 5 by t in our bound, giving 

c 2 \\h\\ 2 < f {t) = — sup (v,A J z). 

1717 v&K x rtB 2 

The proof is completed by applying the inequality of Event 2. □ 


It remains to prove Lemma 4.4. 


Proof. We begin with the following simple comparison, which follows from 
the Cauchy-Schwartz inequality for all v £ M n : 

IIAulh 

\\Av 2 > ' ■ 4.10) 

y/m, 


Furthermore, since K is star shaped, we have 


inf 

v^KntB^ 



inf 

u&KrtS ™- 1 


Am||i 

t 


(4.11) 


(Indeed, u = tu/||u ||2 lies in K since t/||u|| < 1 and I\ is star shaped.) 

Next, we will control ||Aii||i with an application of the following uniform 
deviation inequality, which we proved in [35]. 


Lemma 4.5 (Uniform deviation for the i\ norm). Let K C and let 
r,t > 0. Then, with probability at least 1 — 2 exp(— mr 2 /t 2 ), the following 
holds for all u £ K satisfying \\u \\2 < t: 


1 

m 


I Aif.ll 1 — 



4w t (K) 

- 7 =- 1 “ r - 

yjm 


Choosing r = t/2 in this lemma, we conclude that with probability at 
least 1 — exp(— m/8), one has 


inf —1| Aifc|| 1 > cf where c = 

u&KrtS n ~ 1 m 


4 w t (I\) 


'm 
\2 1+2 


(4.12) 


Recalling the assumption of Lemma 4.4 that m > wt(K) 2 /t 2 , we see that 
c is bounded below by a positive absolute constant. In this case, we can 
substitute the bound into (4.11) to obtain 

'[ Av\\ 1 


inf 

vEKntB!^ 


> cm. 


\v 2 
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We finish the proof by an application of inequality (4.10). □ 

5. DISCUSSION 

We have analyzed the iv-Lasso for signal reconstruction from the semi- 
parametric single-index model. We showed that the K -Lasso solution under 
the non-linear model y* = f((a,i,x}) behaves roughly like the K -Lasso solu¬ 
tion under the noisy linear model yi = yx + azi with z\ ~ IV(0,1), where 
f .l = y(f) and a = <r(/) have simple expressions; the error of the A'-Lasso 
is controlled by the local mean width of I\. We hope this theoretical result 
may aid researchers who use the A'-Lasso in situations when the response 
may not be linear. See [12] for one such implementation. 

We have made some idealized assumptions in this paper thus allowing 
theoretical results that are simple to state and understand. There are many 
future directions of research both of theoretical and practical interest, par¬ 
ticularly in softening assumptions, which we describe below. 

We considered a Gaussian design matrix, A, and this allowed for a clean 
theoretical result. It is of interest to determine whether these results have 
some universality properties. Can the same kind of accuracy be expected 
for random non-Gaussian matrices? Under the linear model, universality 
results have been shown in the compressed sensing literature [15], that is, 
theoretical performance based on a Gaussian matrix is shown to empirically 
match the performance for many other kinds of matrices. However, there 
is an extra wrinkle under the single-index model: a universality result is 
impossible when x is extremely sparse [2]. When A has independent sub- 
Gaussian entries, we conjecture that the results of our paper should still 
hold, although with an extra error term that becomes large when x is very 
sparse, and shrinks towards zero if x is spread out. It is of interest to iron 
out this theory and also to determine, both theoretically and empirically, 
how far these results may extend towards general design matrices. 

Another direction of interest is robustness of the A'-Lasso to model in¬ 
accuracies. Will the A'-Lasso solution remain accurate if the single-index 
model is only approximately true, or if yx does not quite reside in K ? 

Finally, these results lead to new opportunities in signal processing prob¬ 
lems in which the scientist has some control over the non-linearity /, e.g., 
for quantization (see [42]). In that case, the explicit expressions for y(f) 
and cr(f) may be tuned to optimize the error. It is of interest to identify 
other such problems, aside from quantization, that can benefit from this. 
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